53 research outputs found

    Scaling column imprints using advanced vectorization

    Get PDF

    Column Imprints: A Secondary Index Structure

    Get PDF
    Large scale data warehouses rely heavily on secondary indexes, such as bitmaps and b-trees, to limit access to slow IO devices. However, with the advent of large main memory systems, cache conscious secondary indexes are needed to improve also the transfer bandwidth between memory and cpu. In this paper, we introduce column imprint, a simple but efficient cache conscious secondary index. A column imprint is a collection of many small bit vectors, each indexing the data points of a single cacheline. An imprint is used during query evaluation to limit data access and thus minimize memory traffic. The compression for imprints is cpu friendly and exploits the empirical observation that data often exhibits local clustering or partial ordering as a side-effect of the construction process. Most importantly, column imprint compression remains effective and robust even in the case of unclustered data, while other state-of-the-art solutions fail. We conducted an extensive experimental evaluation to assess the applicability and the performance impact of the column imprints. The storage overhead, when experimenting with real world datasets, is just a few percent over the size of the columns being indexed. The evaluation time for over 40000 range queries of varying selectivity revealed the efficiency of the proposed index compar

    Heuristics-based query optimisation for SPARQL

    Get PDF
    Query optimization in RDF Stores is a challenging problem as SPARQL queries typically contain many more joins than equivalent relational plans, and hence lead to a large join order search space. In such cases, cost-based query optimization often is not possible. One practical reason for this is that statistics typically are missing in web scale setting such as the Linked Open Datasets (LOD). The more profound reason is that due to the absence of schematic structure in RDF, join-hit ratio estimation requires complicated forms of correlated join statistics; and currently there are no methods to identify the relevant correlations beforehand. For this reason, the use of good heuristics is essential in SPARQL query optimization, even in the case that are partially used with cost-based statistics (i.e., hybrid query optimization). In this paper we describe a set of useful heuristics for SPARQL query optimizers. We present these in the context of a new Heuristic SPARQL Planner (HSP) that is capable of exploiting the syntactic and the structural variations of the triple patterns in a SPARQL query in order to choose an execution plan without the need of any cost model. For this, we define the variable graph and we show a reduction of the SPARQL query optimization problem to the maximum weight independent set problem. We implemented our planner on top of the MonetDB open source column-store and evaluated its effectiveness against the state-ofthe-art RDF-3X engine as well as comparing the plan quality with a relational (SQL) equivalent of the benchmarks

    A database system with amnesia

    Get PDF
    Big Data comes with huge challenges. Its volume and velocity makes handling, curating, and analytical processing a costly affair. Even to simply “look at” the data within an a priori defined budget and with a guaranteed interactive response time might be impossible to achieve. Commonly applied scale-out approaches will hit the technology and monetary wall soon, if not done so already. Likewise, blindly rejecting data when the channels are full, or reducing the data resolution at the source, might lead to loss of valuable observations. An army of well-educated database administrators or full software stack architects might deal with these challenges albeit at substantial cost. This calls for a mostly knobless DBMS with a fundamental change in database management. Data rotting has been proposed as a direction to find a solution [10, 11]. For the sake of storage management and responsiveness, it lets the DBMS semi-autonomously rot away data. Rotting is based on the systems own unwillingness to keep old data as easily accessible as fresh data. This paper sheds more light on the opportunities and potential impacts of this radical departure in data management. Specifically, we study the case where a DBMS selectively forgets tuples (by marking them inactive) under various amnesia scenarios and with different implementation strategies. Our ultimate goal is to use the findings of this study to morph an existing data management engine to serve demanding big data scientific applications with well-chosen built-in data a

    Column-store support for RDF data management: not all swans are white

    Get PDF
    This paper reports on the results of an independent evaluation of the techniques presented in the VLDB 2007 paper "Scalable Semantic Web Data Management Using Vertical Partitioning", authored by D. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach. We revisit the proposed benchmark and examine both the data and query space coverage. The benchmark is extended to cover a larger portion of the query space in a canonical way. Repeatability of the experiments is assessed using the code base obtained from the authors. Inspired by the proposed vertically-partitioned storage solution for RDF data and the performance figures using a column-store, we conduct a complementary analy- sis of state-of-the-art RDF storage solutions. To this end, we employ MonetDB/SQL, a fully-functional open source column-store, and a well-known --- for its performance --- commercial row-store DBMS.We implement two relational RDF storage solutions – triple-store and vertically-partitioned --- in both systems. This allows us to expand the scope of with the performance characterization along both dimensions --- triple-store vs. vertically-partitioned and row-store vs. column-store --- individually, before analyzing their combined effects. A detailed report of the experimental test-bed, as well as an in-depth analysis of the parameters involved, clarify the scope of the solution originally presented and position the results in a broader context by covering more systems

    Scientific discovery through weighted sampling

    No full text
    Scientific discovery has shifted from being an exercise of theory and computation, to become the exploration of an ocean of observational data. Scientists explore data originated from modern scientific instruments in order to discover interesting aspects of it and formulate their hypothesis. Such workloads press for new database functionality. We aim at sampling scientific databases to create many different impres- sions of the data, on which the scientists can quickly evaluate exploratory queries. However, scientific databases introduce different challenges for sample construction compared to classical business analytical applications. We propose adaptive weighted sampling as an alternative to uniform sampling. With weighted sampling only the most informative data is being sampled, thus more relevant data to the scientific discovery is available to examine a hypothesis. Relevant data is considered to be the focal points of the scientific search, and can be defined either a priori with the use of functions, or by monitoring the query workload. We study such query workloads, and we detail different families of weight functions. Finally, we give a quantitative and qualitative evaluation of weighted sampling
    corecore